Using Disk Throughput Data in Predictions of End-to-End Grid Data Transfers
نویسندگان
چکیده
Data Grids provide an environment for communities of researchers to share, replicate and manage access to copies of large datasets. In such environments, fetching data from one of the several replica locations requires accurate predictions of end-to-end transfer times. Predicting transfer time is significantly complicated due to the involvement of several shared components such as networks, disks, etc., in the end-to-end data path each of which experiences load variations that can significantly affect the throughput. Of these, disk accesses are rapidly growing in cost, and have not been previously considered, although on some machines they can be up to 30% of the transfer time. In this paper, we present techniques to combine observations of end-to-end application behavior and disk I/O throughput load data. We develop a set of regression models to derive predictions that characterize the effect of disk load variations on file transfer times. We also include network component variations and apply these techniques to the logs of transfer data using the GridFTP server, part of the Globus ToolkitTM. We observe up to 9% improvement in prediction accuracy when compared with approaches based on past system behavior in isolation.
منابع مشابه
Using Regression Techniques to Predict Large Data Transfers
The recent proliferation of Data Grids and the increasingly common practice of using resources as distributed data stores provide a convenient environment for communities of researchers to share, replicate, and manage access to copies of large datasets. This has led to the question of which replica can be accessed most efficiently. In such environments, fetching data from one of the several rep...
متن کاملPredicting Sporadic Grid Data Transfers
The increasingly common practice of (1) replicating datasets and (2) using resources as distributed data stores in Grid environments has lead to the problem of determining which replica can be accessed most efficiently. Due to diverse performance characteristics and load variations of several components in the end-to-end path linking these various locations, selecting a replica location from am...
متن کاملPhEDEx high-throughput data transfer management system
Distributed data management at LHC scales is a staggering task, accompanied by equally challenging practical management issues with storage systems and widearea networks. CMS data transfer management system, PhEDEx, is designed to handle this task with minimum operator effort, automating the workflows from large scale distribution of HEP experiment datasets down to reliable and scalable transfe...
متن کاملCluster-to-cluster data transfer with data compression over wide-area networks
The recent emergence of ultra high-speed networks up to 100 Gb/s has posed numerous challenges and has led to many investigations on efficient protocols to saturate 100 Gb/s links. However, end-to-end data transfers involve many components, not only protocols, affecting overall transfer performance. These components include disk I/O subsystem, additional computation associated with data streams...
متن کاملDotDFS: A Grid-based high-throughput file transfer system
DotGrid platform is a Grid infrastructure integrated with a set of open and standard protocols recently implemented on the top of Microsoft .NET in Windows and MONO .NET in UNIX/Linux. DotGrid infrastructure along with its proposed protocols provides a right and solid approach to targeting other platforms, e.g., the native C/C++ runtime. In this paper, we propose a new file transfer protocol ca...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002